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Abstract 

We consider the problem of indexing a string t of length n to report the occurrences of a query 
pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p 
in t, and a the size of the alphabet. We obtain the following results. 

• A linear space index with query time 0{m + a J log log n + occ) . This significantly improves 
the previously best known linear space index by Lam et al. [ISAAC 2007], which requires 
query time O(jn) in the worst case. 

• An index with query time 0(m + j + occ) using space 0(a k n log fe log n) , where k is the 
maximum number of wildcards allowed in the pattern. This is the first non-trivial bound 
with this query time. 

• A time-space trade-off, generalizing the index by Cole et al. [STOC 2004]. 

We also show that these indexes can be generalized to allow variable length gaps in the pattern. 
Our results are obtained using a novel combination of well-known and new techniques, which 
could be of independent interest. 

1 Introduction 

The string indexing problem is to build an index for a string t such that the occurrences of a 
query pattern p can be reported. The classic suffix tree data structure [38] combined with perfect 
hashing [15] gives a linear space solution for string indexing with optimal query time, i.e., an 
0(n) space data structure that supports queries in 0(m + occ) time, where occ is the number of 
occurrences of p in t. 

Recently, various extensions of the classic string indexing problem that allow errors or wildcards 
(also known as gaps or don't cares) have been studied [6 lJ ll |J 24, 28 . 32 , 36 , 37] . In this paper, we focus 
on one of the most basic of these extensions, namely, string indexing for patterns with wildcards. 
In this problem, only the pattern contains wildcards, and the goal is to report all occurrences of p 
in t, where a wildcard is allowed to match any character in t. 

String indexing for patterns with wildcards finds several natural applications in large-scale data 
processing areas such as information retrieval, bioinformatics, data mining, and internet traffic 
analysis. For instance in bioinformatics, the PROSITE data base [5JET] supports searching for 
protein patterns containing wildcards. 

'Preliminary version appeared in Proceedings of the 13th Scandinavian Symposium and Workshops on Algorithm 
Theory. Lecture Notes in Computer Science, vol. 7357, pp. 283-294, Springer 2012. 

' Supported by a grant from the Danish Council for Independent Research | Natural Sciences 
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Despite significant interest in the problem and its many variations, most of the basic questions 
remain unsolved. We introduce three new indexes and obtain several new bounds for string indexing 
with wildcards in the pattern. If the index can handle patterns containing an unbounded number 
of wildcards, we call it an unbounded wildcard index, otherwise we refer to the index as a k-bounded 
wildcard index, where k is the maximum number of wildcards allowed in p. Let n be the length of 
the indexed string t, and a be the size of the alphabet. We define m and j to be the number of 
characters and wildcards in p, respectively. Consequently, the length of p is m + j. We show that, 

• There is an unbounded wildcard index with query time 0(m + cr^loglogra + occ) using 
linear space. This significantly improves the previously best known linear space index by 
Lam et al. [21], which requires query time ®(jn) in the worst case. Compared to the index 
by Cole et al. [11] having the same query time, we improve the space usage by a factor logn. 

• There is a fc-bounded wildcard index with query time 0(m+j+occ) using space 0[a k2 n log fc log n). 
This is the first non-trivial space bound with this query time. 

• There is a time-space trade-off for fc-bounded wildcard indexes. This trade-off generalizes the 
index described by Cole et al. [llj . 

Furthermore, we generalize these indexes to support variable length gaps in the pattern. 
1.1 Previous Work 

Exact string matching has been generalized with error bounds in many different ways. In particular, 
allowing matches within a bounded hamming or edit distance, known as approximate string match- 
ing, has been subject to much research ffi[6| [T0HT2l [T9 , 25 , 26 , 28 , 32 , 35 , 37j . Another generalization 
was suggested by Fischer and Paterson [14], allowing wildcards in the text or pattern. 

Work on the wildcard problem has mostly focused on the non- indexing variant, where the string 
t is not preprocessed in advance [4,8,9l[Il ^ [l"4" l l2"3"]. Some solutions to the indexing problem consider 
the case where wildcards appear only in the indexed string [36] or in both the string and the 
pattern [11|I24|. 

In the following, we summarize the known indexes that support wildcards in the pattern only. 
We focus on the case where k > 1, since for k = the problem is classic string indexing. For 
k = 1, Cole et al. [H] describe a selection of specialized solutions. However, these solutions do not 
generalize to larger k. 

Several simple solutions to the problem exist for k > 1. Using a suffix tree T for t [38] . we can 
find all occurrences of p in a top-down traversal starting from the root. When we reach a wildcard 
character in p in location I £ T, the search branches out, consuming the first character on all 
outgoing edges from £. This gives an unbounded wildcard index using 0(n) space with query time 
0{a 3 m + occ), where occ is the total number of occurrences of p in t. Alternatively, we can build a 
compressed trie storing all possible modifications of all suffixes of t containing at most k wildcards. 
This gives a /c-bounded wildcard index using 0(n k+1 ) space with query time 0(m + j + occ). 

In 2004, Cole et al. pT] gave an elegant /c-bounded wildcard index using 0(nlog fc n) space 
and with 0{m + 2 3 log logn + occ) query time. For sufficiently small values of j this significantly 
improves the previous bounds. The key components in this solution are a new data structure for 
longest common prefix (LCP) queries and a heavy path decomposition [20] of the suffix tree for the 
text t. Given a pattern p, the LCP data structure supports efficient insertion of all suffixes of p 
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Type 


Query Time 


Space 


Solution 


Unbounded 


0(m + £i=0 occ (Pi.*)) 
0(m + j mino^i^j occ(pi, t)) 
0(tT J m + occ) 
0(m + <t j log log ro + occ) 
0(m + cr J log log n + occ) 


O(n) 
O(n) 
O(n) 
O(n) 

0(n log n) 


Iliopoulos and Rahman 1221 
Lam et al. |23) 
Simple suffix tree index f 
ART decomposition f 

Cole et al. [ill 


fc-bounded 


0(m + /3 J log log n + occ) 

0(m + 2 J log log n + occ) 
0(m + j + occ) 
0(m + j + occ) 


0(n log ralog^ 1 n) 

0(n log fe n) 

0(ncr h log fe log n) 

0(n fe + 1 ) 


Heavy ce-tree decomposition f 
Cole et al. [ll] 
Special index for m < er fe log log n f 
Simple linear time index f 



Table 1: f = presented in this paper. The term occ(pi,t) denotes the number of matches of pi in t 
and is 0(n) in the worst case. 



into the suffix tree for t, such that subsequent longest common prefix queries between any pair 
of suffixes from t and p can be answered in (log log n) time. This is where the log log n term in 
the query time comes from. The heavy path decomposition partitions the suffix tree into disjoint 
heavy paths such that any root-to-leaf path contains at most a logarithmic number of heavy paths. 
Cole et al. [H] show how to reduce the size of the index by only creating additional wildcard tries 
for the off-path subtries. This leads to the 0(nlog fc n) space bound. Secondly, using the new tries, 
the top-down search branches at most twice for each wildcard, leading to the 2 3 term in the query 
time. Though Cole et al. [TT] did not consider unbounded wildcard indexes, the technique can be 
extended to this case by using only the LCP data structure and omitting the additional wildcard 
tries. This leads to an unbounded wildcard index with query time 0(m + a J log log n + occ) using 
space O(nlogn). 

The solutions described by Cole et al. all have bounds which are exponential in the number 
of wildcards in the pattern. Very recently, Lewenstein |27| used similar techniques to improve the 
bounds to be exponential in the number of gaps in the pattern (a gap is a maximal substring of 
consecutive wildcards). Assuming that the pattern contains at most g gaps each of size at most 
G, Lewenstein obtains a bounded index with query time 0{m + 2 7 log log n + occ) using space 
0{n(G 2 logn) 9 ), where 7 < g is the number of gaps in the pattern. 

A different approach was taken by Iliopoulos and Rahman [22], who describe an unbounded 
wildcard index using linear space. For a pattern p consisting of strings pcbPi; • • • ->Pj (subpatterns) 
interleaved by j wildcards, the query time of the index is 0(m + Ya=o 0CC (.Pi' *))> where occ(pi,t) 
denotes the number of matches of pt in t. This was later improved by Lam et al. [23] with an index 
that determines complete matches by first identifying potential matches of the subpatterns in t and 
subsequently verifying each possible match for validity using interval stabbing on the subpatterns. 
Their solution is an unbounded wildcard index with query time O (m + j mino<i<j occ(pi,t)) using 
linear space. However, both of these solutions have a worst case query time of @(jn), since there 
may be @(n) matches for a subpattern, but no matches of p. ITable II summarizes the existing 
solutions for the problem in relation to our results. 

The unbounded wildcard index by Iliopoulos and Rahman [22] was the first index to achieve 
query time linear in m while using O(n) space. Recently, Chan et al. [6] considered the related 
problem of obtaining a ^-mismatch index supporting queries in time linear in m and using O(n) 
space. They describe an index with a query time of 0(m + (logn) fc ( fc+1 ) log logn + occ). However, 
this bound assumes a constant-size alphabet and a constant number of errors. In this paper we 
make no assumptions on the size of these parameters. 
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1.2 Our Results 



Our main contribution is three new wildcard indexes. 

Theorem 1. Let t be a string of length n from an alphabet of size a . There is an unbounded 
wildcard index for t using 0(n) space. The index can report the occurrences of a pattern with m 
characters and j wildcards in time 0(m + a 3 log log n + occ). 

Compared to the solution by Cole et al. [IT], we obtain the same query time while reducing the 
space usage by a factor log n. We also significantly improve upon the previously best known linear 
space index by Lam et al. |24j . as we match the linear space usage while improving the worst-case 
query time from Q(jn) to 0(m + a 3 log log n + occ) provided j < log CT n. Our solution is faster than 
the simple suffix tree index for m = $7(loglogn). Thus, for sufficiently small j we improve upon 
the previously known unbounded wildcard indexes. 

The main idea of the solution is to combine an ART decomposition [1] of the suffix tree for t 
with the LCP data structure. The suffix tree is decomposed into a number of logarithmic-sized 
bottom trees and a single top tree. We introduce a new variant of the LCP data structure for use 
on the bottom trees, which supports queries in logarithmic time and linear space. The logarithmic 
size of the bottom trees leads to LCP queries in time O (log log n). On the top tree we use the 
LCP data structure by Cole et al. to answer queries in time O(loglogn). The number of LCP 
queries performed during a search for p is 0(a J ), yielding the a 3 log log n term in the query time. 
The reduced size of the top tree causes the index to be linear in size. 

Theorem 2. Let t be a string of length n from an alphabet of size a. For 2 < f3 < a, there is a 
k-bounded wildcard index using 0(n log (n) log^ -1 n) space. The index can report the occurrences in 
t of a pattern with m characters and j < k wildcards in time 0(m + f3 3 log log n + occ). 

The theorem provides a time-space trade-off for ^-bounded wildcard indexes. Compared to 
the index by Cole et al. [IT] , we reduce the space usage by a factor log fc_1 (3 by increasing the 
branching factor from 2 to (3. For ft = 2 the index is identical to the index by Cole et al. The 
result is obtained by generalizing the wildcard index described by Cole et al. We use a heavy 
a-tree decomposition, which is a new technique generalizing the classic heavy path decomposition 
by Harel and Tarjan [20J. This decomposition could be of independent interest. We also show that 
for (3 = 1 the same technique yields an index with query time 0(m + j + occ) using space 0(nh k ), 
where h is the height of the suffix tree for t. 

Theorem 3. Let t be a string of length n from an alphabet of size a. There is a k-bounded wildcard 
index for t using 0(a k nlog fc logn) space. The index can report the occurrences of a pattern with 
m characters and j < k wildcards in time 0(m + j + occ). 

To our knowledge this is the first linear time index with a non-trivial space bound. The result 
improves upon the space usage of the simple linear time index when a k < n/ log logn. To achieve 
this result, we use the 0(nh k ) space index to obtain a black-box reduction that can produce a 
linear time index from an existing index. The idea is to build the 0(nh k ) space index with support 
for short patterns, and query another index if the pattern is long. This technique is closely related 
to the concept of filtering search introduced by Chazelle |7j and has previously been applied for 
indexing problems [2l[6]. The theorem follows from applying the black-box reduction to the index 
of lTheorem 11 
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1.2.1 Variable Length Gaps 

We also show that the three indexes support searching for query patterns with variable length gaps, 
i.e., patterns of the form p = p *{a\,bi} pi *{a2,&2}--- *{o-j,bj} pj, where *{ai,bi} denotes a 
variable length gap that matches an arbitrary substring of length between a, and bi, both inclusive. 

String indexing for patterns with variable length gaps has applications in information retrieval, 
data mining and computational biology [T6 |ll8p29y311l33| . In particular, the PROSITE data base [5j 
|2T] uses patterns with variable length gaps to identify and classify protein sequences. The problem 
is a generalization of string indexing for patterns with wildcards, since a wildcard * is equivalent 
to the variable length gap *{1, 1}. Variable length gaps are also known as bounded wildcards, as a 
variable length gap *{aj, bi} can be regarded as a bounded sequence of wildcards. 

String indexing for patterns with variable length gaps is equivalent to string indexing for patterns 
with wildcards, with the addition of allowing optional wildcards in the pattern. An optional wildcard 
matches any character from £ or the empty string, i.e., an optional wildcard is equivalent to the 
variable length gap *{0, 1}. Conversely, we may also consider a variable length gap *{cii,bi} as a« 
consecutive wildcards followed by bi — consecutive optional wildcards. 

Lam et al. [24] introduced optional wildcards in the pattern and presented a variant of their 
solution for the string indexing for patterns with wildcards problem. The idea is to determine 
potential matches and verify complete matches using interval stabbing on the possible positions 
for the subpatterns. This leads to an unbounded optional wildcard index with query time 0(m + 
Bj mincKi^' occ(pi,t)) and space usage 0(n). Here B = Yjl=i an d occ(pi,t) denotes the number 
of matches of pi in t, and since occ(pi,t) = B(n) in the worst case, the worst case query time 
is @(Bjn). Recently, Lewenstein [27] considered the special case where the pattern contains at 
most g gaps and dj = bi < G for all i, i.e., the gaps are non-variable and of length at most G. 
Using techniques similar to those by Cole et al. [llj . he gave a bounded index with query time 
0(m + 2 7 log log n + occ) using space 0(n(G 2 logn) 9 ), where 7 < g is the number of gaps in the 
pattern. 

The related string matching with variable length gaps problem, where the text may not be 
preprocessed in advance, has recieved some research attention recently [H[T3[30l|33l[33] • However, 
none of the results and techniques developed for this problem appear to lead to non-trivial bounds 
for the indexing problem. 

Our Results for Variable Length Gaps To introduce our results we let A = Yli=i a i an< ^ 
B = Y2i=i bi denote the sum of the lower and upper bounds on the variable length gaps in p, 
respectively. Hence A and B — A denote the number of normal and optional wildcards in p, 
respectively. A wildcard index with support for optional wildcards is called an optional wildcard 
index. As for wildcard indexes, we distinguish between bounded and unbounded optional wildcard 
indexes. A {k, o)-bounded optional wildcard index supports patterns containing A < k normal 
wildcards and B — A < o optional wildcards. An unbounded optional wildcard index supports 
patterns with no restriction on the number of normal and optional wildcards. 

To accommodate for variable length gaps in the pattern, we only need to modify the way in 
which the wildcard indexes are searched, leading to the following new theorems. The proofs are 
given in lSection 71 

Theorem 4. Let t be a string of length n from an alphabet of size a . There is an unbounded 
optional wildcard index for t using 0(n) space. The index can report the occurrences of a pattern 
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with m characters, A wildcards and B — A optional wildcards in time 0(m+2 B A a B log log n + occ), 
where A = Ya=i a i and B = Y^l=i h- 

Theorem 5. Let t be a string of length n from an alphabet of size a. For 2 < j3 < a, there is 
a (k,o)-bounded optional wildcard index for t using 0(nlog(n)logJ + °- 1 n) space. The index can 
report the occurrences of a pattern with m characters, A < k wildcards and B — A < o optional 
wildcards in time O (to + 2 B ~ A f3 B log log n + occ) , where A = X)i=i a i and B = Y2i=i 

Theorem 6. Let t be a string of length n from an alphabet of size a. There is a (k, o)-bounded 
optional wildcard index for t using 0(a( fe+ °) 2 nlog fc+0 log n) space. The index can report the occur- 
rences of a pattern with m characters, A < k wildcards and B — A < o optional wildcards in time 
0(2 B - A (m + B) + occ), where A = Y,i=i a i and B = YjL=i b i- 

These results completely generalize our previous solutions, since if the query pattern only contains 
variable length gaps of the form *{1, 1}, the problem reduces to string indexing for patterns with 
wildcards. In that case A = B = j and we obtain exactly ITheorem 11 ITheorem 21 and ITheorem 31 
Compared to the only known index for the problem by Lam et al. [24J , ITheorem 41 gives an 
unbounded optional wildcard index that matches the 0{n) space usage, but improves the worst- 
case query time from Q(Bjn) to O (to + 2 B ~ A a B log log n + occ), provided that B < log a y/nj. 



2 Preliminaries 

We introduce the following notation. Let p = po *pi * . . . *pj be a pattern consisting of j + 1 strings 
Po,Pi, . . . ,Pj G E* (subpatterns) interleaved by j < k wildcards. The substring starting at position 
I G {1, . . . ,n} in t is an occurrence of p if and only if each subpattern pi matches the corresponding 
substring in t. That is, 



Pi = t 



I + i + V] \Pr\J + i - 1 + ^ \Pr 



=0 



=0 



for 



o,i,..., j , 



where t[i,j] denotes the substring of t between indices i and j, both inclusive. We define t[i,j] = e 
for i > j, t[i,j] = t[l,j] for i < 1 and t[i,j] = t[i, \t\] for j > \t\. Furthermore m = ^r=o \Pr\ * s the 
number of characters in p, and we assume without loss of generality that m > and k > 0. 

Letprefj(t) = t[l,i] and suffj(t) = t[i,n] denote the prefix and suffix of t of length i and n — i + 1, 
respectively. Omitting the subscripts, we let pref(t) and suff(i) denote the set of all non-empty 
prefixes and suffixes of t, respectively. We extend the definitions of prefix and suffix to sets of 
strings S C S* as follows. 



prefi(5) 
pref(S) 



{prefj(x) 
pref(a 



x € S} 



sutti(S) 
suff(S) 



{sufTj(x) | x G S} 
[j suff(x) 



xes 



A set of strings S is prefix-free if no string in S is a prefix of another string in S. Any string set S 
can be made prefix-free by appending the same unique character $ ^ S to each string in S. 
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2.1 Trees and Tries 



For a tree T, the root is denoted root(T), while height (T) is the number of edges on a longest path 
from root(T) to a leaf of T. A compressed trie T(S) is a tree storing a prefix-free set of strings 
S C T,*. The edges are labeled with substrings of the strings in S, such that a path from the root 
to a leaf corresponds to a unique string in S. All internal vertices (except the root) have at least 
two children, and all labels on the outgoing edges of a vertex have different initial characters. 

A location £ G T(S) may refer to either a vertex or a position on an edge in T(S). Formally, 
I = (v, s) where v is a vertex in T(S) and s G X* is a prefix of the label on an outgoing edge of 
v. If s = e, we also refer to £ as an explicit vertex, otherwise I is called an implicit vertex. There 
is a one-to-one mapping between locations in T(S) and unique prefixes in pref(5 l ). The prefix 
x G pref(S) corresponding to a location £ G T(S) is obtained by concatenating the edge labels on 
the path from root(T(5)) to I. Consequently, we use x and £ interchangeably, and we let \£\ = \x\ 
denote the length of x. Since S is assumed prefix-free, each leaf of T(S) is a string in S, and 
conversely. The suffix tree for t denotes the compressed trie over all suffixes of t, i.e., T(suff(t)). 
We define T^(S) as the subtrie of T(S) rooted at £. That is, Tg(S) contains the suffixes of strings 
in T(S) starting from I, Formally, Te(S) = T(Sg), where 



2.2 Heavy Path Decomposition 

For a vertex v in a rooted tree T, we define weight (v) to be the number of leaves in T v , where 
T v denotes the subtree rooted at v. We define weight (T) = weight (root (T)). The heavy path 
decomposition of T, introduced by Harel and Tarjan |20j . classifies each edge as either light or 
heavy. For each vertex v G T, we classify the edge going from v to its child of maximum weight 
(breaking ties arbitrarily) as heavy. The remaining edges are light. This construction has the 
property that on a path from the root to any vertex, O (log (weight (T))) heavy paths are traversed. 
For a heavy path decomposition of a compressed trie T(S), we assume that the heavy paths are 
extended such that the label on each light edge contains exactly one character. 



Cole et al. introduced the the Longest Common Prefix (LCP) data structure, which provides a 
way to traverse a compressed trie without tracing the query string one character at a time. In this 
section we give a brief, self-contained description of the data structure and show a new property 
that is essential for obtaining ITheorem 11 

The LCP data structure stores a collection of compressed tries T(Ci),T(C2), ■ ■ ■ ,T(C q ) over 
the string sets C±,C2, ■ ■ ■ ,C g C S*. Each Cj is a set of substrings of the indexed string t. The 
purpose of the LCP data structure is to support LCP queries 

hCP(x,i,£): Returns the location in T{Ci) where the search for the string x G X* stops when 
starting in location I G T(Ci). 

If £ is the root of T(Cj), we refer to the above LCP query as a rooted LCP query. Otherwise the 
query is called an unrooted LCP query. In addition to the compressed tries T(C±), . . . ,T(C q ), the 




3 The LCP Data Structure 
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LCP data structure also stores the suffix tree for t, denoted T(C) where C = suff(f). The following 
lemma is implicit in the paper by Cole et al. [11J. 

Lemma 1 (Cole et al. [llj). Provided x has been preprocessed in time 0(\x\), the LCP data 
structure can answer rooted LCP queries on T(Cj) for any suffix of x in time 0(loglog |C|) using 
space 0(\C\ + Yli=i | C* | ) - Unrooted LCP queries on T{C.{) can be performed in time 0(loglog \C\) 
using 0{\Ci \ log |Cj|) additional space. 

We extend the LCP data structure by showing that support for slower unrooted LCP queries 
on a compressed trie TiCj) can be added using linear additional space. 

Lemma 2. Unrooted LCP queries on T{Ci) can be performed in time 0(log |Cj| +loglog \C\) using 
0(| Ci |) additional space. 

Proof. We initially create a heavy path decomposition for all compressed tries T(Ci), . . . ,T(C q ). 
The search path for x starting in £ traverses a number of heavy paths in T(Cj). Intuitively, an 
unrooted LCP query can be answered by following the 0(log|Cj|) heavy paths that the search 
path passes through. For each heavy path, the next heavy path can be identified in constant time. 
On the final heavy path, a predecessor query is needed to determine the exact location where the 
search path stops. 

For a heavy path H, we let h denote the distance that the search path for x follows H. 
Cole et al. [IT] showed that h can be determined in constant time by performing nearest com- 
mon ancestor queries on T(C). To answer LCP(x,i,£) we identify the heavy path H of T(Cj) that 
I is part of and compute the distance h as described by Cole et al. If x leaves H on a light edge, 
indexing distance h into H from £ yields an explicit vertex v. At v, a constant time lookup for 
x[h + 1] determines the light edge on which x leaves H. Since the light edge has a label of length 
one, the next location £' on that edge is the root of the next heavy path. We continue the search for 
the remaining suffix of x from £' recursively by a new unrooted LCP query LCP(suff/ l+ 2(x), i, £'). If 
H is the heavy path on which the search for x stops, the location at distance h (i.e., the answer to 
the original LCP query) is not necessarily an explicit vertex, and may not be found by indexing into 
H. In that case a predecessor query for h is performed on H to determine the preceding explicit 
vertex and thereby the location LCP(x,i,£). Answering an unrooted LCP query entails at most 
log|Cj| recursive steps, each taking constant time. The final recursive step may require a prede- 
cessor query taking time 0(loglog |C|). Consequently, an unrooted LCP query can be answered in 
time 0(log |Cj| -l-loglog |C|) using 0(|Cj|) additional space to store the predecessor data structures 
for each heavy path. □ 

4 An Unbounded Wildcard Index Using Linear Space 

In this section we show how to obtain iTheoremTl by applying an ART decomposition on the suffix 
tree for t and storing the top and bottom trees in the LCP data structure. 

4.1 ART Decomposition 

The ART decomposition introduced by Alstrup et al. [T] decomposes a tree into a single top tree 
and a number of bottom trees. The construction is defined by two rules: 
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1. A bottom tree is a subtree rooted in a vertex of minimal depth such that the subtree contains 
no more than x leaves. 

2. Vertices that are not in any bottom tree make up the top tree. 

The decomposition has the following key property. 

Lemma 3 (Alstrup et al. pQ). The ART decomposition with parameter % for a rooted tree T with 
n leaves produces a top tree with at most leaves. 

4.2 Obtaining the Index 

Applying an ART decomposition on T(suff(i)) with \ = logn, we obtain a top tree T' and a 
number of bottom trees Bi,I>2, ■ ■ ■ , B q each of size at most logn. From lLemma 31 T' has at most 
j^p^ leaves and hence O(j^g^) vertices since T 1 is a compressed trie. 

To facilitate the search, the top and bottom trees are stored in an LCP data structure, noting 
that these compressed tries only contain substrings of t. Using ILemma"2"1 we add support for 
unrooted 0(logx + log logn) = O (log logn) time LCP queries on the bottom trees using O(n) 
additional space in total. For the top tree we apply ILemmaH to add support for unrooted LCP 
queries in time O(loglogn) using 0(|^^log^|^) = 0(n) additional space. Since the branching 
factor is not reduced, 0{a l ) LCP queries, each taking time O(loglogn), are performed for the 
subpattern pi. This concludes the proof of lTheorem 11 

5 A Time-Space Trade-Off for /c-Bounded Wildcard Indexes 

In this section we will show [Theorem 2l We first introduce the necessary constructions. 
5.1 Heavy a-Tree Decomposition 

The heavy a-tree decomposition is a generalization of the well-known heavy path decomposition 
introduced by Harel and Tarjan [20]. The purpose is to decompose a rooted tree T into a number of 
heavy trees joined by light edges, such that a path to the root of T traverses at most a logarithmic 
number of heavy trees. For use in the construction, we define a proper weight function on the 
vertices of T, to be a function satisfying weight (?;) > Yl w child of v weight(w) . Observe that using 
the number of vertices or the number of leaves in the subtree rooted at v as the weight of v satisfies 
this property. The decomposition is then constructed by classifying edges in T as being heavy or 
light according to the following rule. For every vertex v S T, the edges to the a heaviest children 
of v (breaking ties arbitrarily) are heavy, and the remaining edges are light. For a = 1 this results 
in a heavy path decomposition. Given a heavy a-tree decomposition of T, we define lightdepth Q (v) 
to be the number of light edges on a path from the vertex v 6 T to the root of T. The key property 
of this construction is captured by the following lemma. 

Lemma 4. For any vertex v in a rooted tree T and a > 

lightdepth Q ,(f ) < log a+1 weight (root (T)) 
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Figure 1: Two different heavy a-tree decompositions with a = 2 of a tree with n = 38 leaves. The 
maximum light depth is 3 and 2, respectively, in agreement with ILemma 41 



Proof. Consider a light edge from a vertex v to its child w. We prove that weight (w) < weight (u), 
implying that lightdepth a (f ) < log a+1 weight (root (T)). To obtain a contradiction, suppose that 
weight («;) > ^q-j- weight (w). In addition to w, v must have a heavy children, each of which has a 
weight greater than or equal to weight (w). Hence 

weight (v) > (1 + a) ■ weight(u>) > (1 + a) ■ weight (v) = weight(u) , 

a + 1 

which is a contradiction. □ 

ILemma "11 holds for any heavy a-tree decomposition obtained using a proper weight function on 
T. In the remaining part of the paper we will assume that the weight of a vertex is the number 



of leaves in the subtree rooted at v. See Figure 1 for two different examples of heavy a-tree 
decompositions. 

We define lightheight Q (T) to be the maximum light depth of a vertex in T, and remark that for 
a = 0, lightheight Q (T) = height (T). For a vertex v in a compressed trie T(S), we let lightstrings(?;) 
denote the set of strings starting in one of the light edges leaving v. That is, lightstrings(v) is the 
union of the set of strings in the subtries Ti(S) where I is the first location on a light outgoing edge 
of v, i.e., |^| = \v\ + 1. 



5.2 Wildcard Trees 

We introduce the (/3, k) -wildcard tree, denoted Tg(C), where 1 < /3 < a is a chosen parameter. This 
data structure stores a collection of strings C C S + in a compressed trie such that the search for 
a pattern p with at most k wildcards branches to at most j3 locations in T^(C') when consuming 
a single wildcard of p. In particular for /3 = 1, the search for p never branches and the search 
time becomes linear in the length of p. For a vertex v, we define the wildcard height of v to be 
the number of wildcards on the path from v to the root. Intuitively, given a wildcard tree that 
supports i wildcards, support for an extra wildcard is added by joining a new tree to each vertex v 
with wildcard height i by an edge labeled *. This tree is searched if a wildcard is consumed in v. 
Formally, Tp(C) is built recursively as follows. 

Construction of T^(S): Produce a heavy (/3 — l)-tree decomposition of T(S), then 
for each internal vertex v 6 T(S) join v to the root of T^ _1 (suff2(lightstrings(v)) by an 
edge labeled *. Let TS(5) = T(S). 
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The construction is illustrated in Figure 2 Since a leaf £ in a compressed trie T(S) is obtained 
as the suffix of a string x £ C, we assume that t inherits the label of x in case the strings in C' 
are labeled. For example, when C denotes the suffixes of t, we will label each suffix in C with its 



start position in t. This immediately provides us with a /abounded wildcard index. Figure 3 shows 
some concrete examples of the construction of Tg(C) when C is a set of labeled suffixes. 




T° 3 (C>) 



ThC) 



T|(C") 



Figure 2: Illustrating of the recursive construction of the wildcard tree Tg(C'). The final tree 
consists of k layers of compressed tries joined by edges labeled *. 



5.3 Wildcard Tree Index 

Given a collection C of strings and a pattern p, we can identify the strings of C having a prefix 
matching p by constructing T^(C"). Searching T^(C) is similar to the suffix tree search, except 
when consuming a wildcard character of p in an explicit vertex v £ Tg(C') with more than j3 
children. In that case the search branches to the root of the wildcard tree joined to v and to the 
first location on the — 1 heavy edges of v, effectively letting the wildcard match the first character 
on all edges from v. Consequently, the search for p branches to a total of at most X^=o $ l = 0{@ 3 ) 
locations, each of which requires 0(m) time, resulting in a query time 0(/3- ? m + occ). For j3 = 1 
the query time is 0(m + j + occ). 

Lemma 5. For any integer 1 < /3 < a, the wildcard tree Tg(C') has query time 0(/3^m + j + occ). 
The wildcard tree stores 0{\C'\H k ) strings, where H is an upper bound on the light height of all 
compressed tries T{S) satisfying S C suff^C") for some integer d. 

Proof. We prove that the total number of strings (leaves) in TUS), denoted |Tg(S)|, is at most 
|5| Ylj=o = 0{\S\H l ). The proof is by induction on i. The base case i = holds, since 
Tp(S) = T(S) contains \S\ = \S\ Y^=o^ 3 strings. For the inductive step, assume that \Tg(S)\ < 
\S\ X^}=o-^ J - Let S v = suff2(lightstrings(f )) for a vertex v 6 T(S). From the construction we have 
that the number of strings in Tl +1 (S) is the number of strings in T(S) plus the number of strings 
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(b) T?(C). 4 (c) Ti(C). 

Figure 3: Showing T^(C) for /3 G {1, 2, 3}, k = 2 and C = suff (bananas$). The recursion levels 0, 
1, 2 in the construction are indicated by increasing growth rings in the vertices. All edges in T^{C) 
are light, since the construction is based on a heavy a-tree decomposition with a = (3 — 1 = 0. 
Leaves are labeled with the start position of their corresponding suffix in t. 
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in the wildcard trees joined to the vertices of T(S). That is, 

\T^(S)\ = \S\+ £ \Tj(S v )\ < \S\+ WE^'- 

veT(S) v£T{S) j=0 

The string sets S v consist of suffixes of strings in 5. Consider a string x £ S, i.e., a leaf in T(S). 
The number of times a suffix of a; appears in a set is equal to the light depth of x in T(S). S is 
also a set of suffixes of C, and hence H is an upper bound on the maximum light depth of T(S). 
This establishes that J2 ve T(s) ^ > thus showing that |T^ +1 (5)| < \S\ + |S|F£} =0 .?F = 

s >:}';, 'I J - " □ 

Constructing the wildcard tree Tg(C), where C = suff(i), we obtain a wildcard index with the 
following properties. 

Lemma 6. Let t be a string of length n from an alphabet of size a. For 2 < j3 < a there is a 
k-bounded wildcard index for t using O(nlog^n) space. The index can report the occurrences of a 
pattern with m characters and j < k wildcards in time O m% + occ) . 

Proof. The query time follows from lLemma 51 Since T^(C) is a compressed trie, and because each 
edge label is a substring of t, the space needed to store Tg(C) is upper bounded by the number of 
strings it contains which by ILemma 51 is 0(nH k ). It follows from lLemma 41 that H = log^n is an 
upper bound on the light height of all compressed tries T(S), since they each contain at most n 
vertices. Consequently, the space needed to store the index is O(nlog^n). □ 

5.4 Wildcard Tree Index Using the LCP Data Structure 

The wildcard index of ILemma "HI reduces the branching factor of the suffix tree search from a to 
/?, but still has the drawback that the search for a subpattern pi from a location i G T^(C) takes 
0(|pi|) time. This can be addressed by combining the index with the LCP data structure as in 
Cole et al. [TT]. In that way, the search for a subpattern can be done in time O (log log n). The index 
is obtained by modifying the construction of T^(S) such that each T(S) is added to the LCP data 
structure prior to joining the (/?, i— 1) -wildcard trees to the vertices of T(S). For all T(S) except the 
final T(S) = T9(S), support for unrooted LCP queries in time O(loglogn) is added using additional 
0(\S\ log l^l) space. For the final T(S), searched when all k wildcards have been matched, we only 
need support for rooted queries. Upon receiving the query pattern p = p\ * p2 * ■ ■ . * Pk, each p, is 
preprocessed in time 0(|pj|) to support LCP queries for any suffix of pi. The search for p proceeds 
as described for the normal wildcard tree, except now rooted and unrooted LCP queries are used 
to search for suffixes of po , p\ , . . . , pk ■ 

In the search for p, a total of at most Y^i=o^ = 0(P 3 ) LCP queries, each taking time 
O (log log n), are performed. Preprocessing Po,pi, ■ ■ ■ ,Pj takes X)i=o \Pi\ = 171 time, so the query 
time is 0(m + ft log log n + occ). The space needed to store the index is 0(n log^ n) for Tp{C) plus 

the space needed to store the LCP data structure. 

Adding support for rooted LCP queries requires linear space in the total size of the compressed 
tries, i.e., O(nlog^n). Let T(So),T(S\), . . . ,T(S q ) denote the compressed tries with support for 
unrooted LCP queries. Since each Si contains at most n strings and Ya=o 1^1 = l^s _1 (^')l' ^J" 
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ILcmma Tl the additional space required to support unrooted LCP queries is 

<3 9 

0($2 \Si\ log \Si\) =0(lognJ2 \Si\) =o(logn|T|- 1 (C")|) = 0(nIog(n) log*" 1 n) , 

(=0 i=0 

which is an upper bound on the total space required to store the wildcard index. This concludes 
the proof of lTheorem 21 The /c-bounded wildcard index described by Cole et al. [Tl] is obtained as 
a special case of lTheorem 21 

Corollary 1 (Cole et al. [TTJ ) . Let t be a string of length n from an alphabet of size a. There is a 
k-bounded wildcard index for t using 0(n\og k n) space. The index can report the occurrences of a 
pattern with m characters and j < k wildcards in time 0(m + 2 3 log log n + occ). 



6 A £>Bounded Wildcard Index with Linear Query Time 

Consider the /c-bounded wildcard index obtained by creating the wildcard tree T 1 fc (suff(t)) for t. 
This index has linear query time, and we can show that the space usage depends of the height of 
the suffix tree. 

Lemma 7. Let t be a string of length n from an alphabet of size a. There is a k-bounded wildcard 
index for t using 0(nh k ) space, where h is the height of the suffix tree for t. The index can report 
the occurrences of a pattern with m characters and j wildcards in time 0(m + j + occ). 

Proof. Since suff(t) is closed under the suffix operation, the height of T(suff(i)) is an upper bound 
on the height of all compressed tries T(S) satisfying S C suff ( j(suff (i)) for some d. For j3 = 1, the 
light height of T(S) is equal to the height of T(S), so H = h = height (T(suff(t))) can be used 
as an upper bound of the light height in ILemma"5l and consequently the space needed to store 
Tf (suff(t)) is 0{nh k ). □ 

In the worst case the height of the suffix tree is close to n, but combining the index with another 
wildcard index yields a useful black box reduction. The idea is to query the first index if the pattern 
is short, and the second index if the pattern is long. 

Lemma 8. Let F > m and let G be independent of m and j. Given a wildcard index A with query 
time 0(F + G + occ) and space usage S, there is a k-bounded wildcard index B with query time 
0(F + j + occ) and taking space 0(n min(G, h) k + S), where h is the height of the suffix tree for t. 

Proof. The wildcard index B consists of A as well as a special wildcard index T 1 fe (pref G (suff (t))) C, 
which is a wildcard tree with f3 = 1 over the set of all substrings of t of length G. G can be used as 
an upper bound for the light height in lLemma 5\ so the space required to store C is 0(n min(G, h) k ) 
by using [Lemma 71 if G > h. A query on B results in a query on either A or C. In case G < F + j, 
we query A and the query time will be 0(F + G + occ) = 0{F + j + occ). In case G > F + j, 
we query C with query time 0(m + j + occ) = 0(F + j + occ). In any case the query time of B is 
0(F + j + occ). □ 

Applying ILemma"8l with F = m and G = a k log log n on the unbounded wildcard index from 
ITheorem ll vields a new /c-bounded wildcard index with linear query time using space 0(a k n log fc log n). 
This concludes the proof of lTheorem 31 
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7 Variable Length Gaps 



We now consider the string indexing for patterns with variable length gaps problem. By only 
changing the search procedure, this problem can be solved using the previously described bounded 
and unbounded wildcard indexes. 

The string indexing for patterns with variable length gaps problem is to build an index for a 
string t that can efficiently report the occurrences of a query pattern p of the form 

p = Po *{ai,bi}px *{a 2 ,b 2 } ■■■ *{aj,bj} pj . 

The query pattern consists of j + 1 strings po,pi, ■ ■ ■ ,Pj G X* interleaved by j variable length gaps 
*{ai,bi}, i = 1, . . . ,j, where a{ and bi are positive integers such that cij < 6j. Intuitively, a variable 
length gap *{ai,bi} matches an arbitrary string over £ of length between aj and 6j, both inclusive. 

Example 1. Consider the string t and pattern p over the alphabet £ = {a, b, c, d}. 

t = acbccbacccddabdaabcdccbccdaa 

p = b*{0,4}cc*{3, 5}d 



The string t contains five occurrences of the query pattern p as shown in Figure 4 



As shown by Example 1 , different occurrences of the query pattern p can start or end at the same 
position in t, and the same substring in t can contain multiple occurrences of p. Hence to completely 
characterize an occurrence of p in t, we need to report the positions of the individual subpatterns 
Po,pi, . . . ,pj for each full occurrence of the pattern. However, in the following we will restrict 
our attention to reporting the start and end position of each occurrence of p in t. For the above 
example, we would thus report the pairs (3, 11), (3, 15), (6, 15) and (18, 26). 



7.1 Supporting Variable Length Gaps 

Recall that a variable length gap *{at,bi} is equivalent to a, wildcards followed by bi — a% optional 
wildcards. Hence to support variable length gaps, we only have to describe how the search algo- 
rithms must be modified to match an optional wildcard in p. We simulate an optional wildcard as 
matching both a normal wildcard and the empty string. When matching a normal wildcard the 
search can only branch in explicit vertices, but for optional wildcards the search will always branch 
to at least two locations. This is the reason for the 2 B ~ factor in the query times of lTheorem 4U 51 
To report the substrings in t where the query pattern occurs, we assume that each leaf I in 
T(suff(i)) has been labeled by the start position, pos(£), of the suffix in t it corresponds to. The 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
acbccbacccddabdaabcdccbccdaa 
bcc*****d 
b^c^^^cc*****d 
b*cc*****d 
b**cc****d 

b**cc***d 



Figure 4: The five occurrences of the query pattern p = b*{0, 4}cc*{3, 5}d in the string t = 
acbccbacccddabdaabcdccbccdaa. 
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search for p terminates in a set of locations 7Z, each corresponding to one or more substrings in t 
where the query pattern p occurs. We can report the start and end position of these substrings by 
traversing the subtrees rooted in the locations of 7Z. For a subtree rooted in £' £ 1Z we identify the 
leaves £$,£\, . . . ,£ r corresponding to suffixes of t having £' as a prefix. The start and end positions 
of these substrings are then given by 

(pos(4),pos(4) + |^|) , (pos(^),pos(4) + \£'\) (pos(^), P os(4) + \£'\) ■ 
7.2 Analysis of the Modified Search 

To analyse the query time we bound the maximum number of LCP queries performed during the 
search for the query pattern 

p = p *{a 2 ,b 2 } ... *{aj, bj} pj . 

We define Ai = Yl]=i a l an d Bi = X/!=i The number of normal and optional wildcards preceding 
the subpattern pi in p is Ai and Bi — Ai, respectively. To bound the number of locations in which 
an LCP query for the subpattern pi can start, we choose and promote I = 0, 1, . . . , Bi — Ai of the 
preceding optional wildcards to normal wildcards and discard the rest. For a specific choice there 
are exactly Ai + 1 wildcards preceding pi, and thus the number of locations in which an LCP query 
for pi can start is at most f3 Ai+l . The term /3 is an upper bound on the branching factor of the 
search when consuming a wildcard. For a suffix tree T(suff(t)) the branching factor is /3 = a, but 
indexes based on wildcard trees can have a smaller branching factor. There are ( Bi ~i Ai ) possibilities 
for choosing the I optional wildcards, so the number of locations in which an LCP query for pi can 
start is at most ^ 

Summing over the j + 1 subpatterns, we obtain a bound of O (2 B ~ A (3 B \ on the number of LCP 
queries performed during a search for the query pattern p. Since LCP queries are performed in time 
O(loglogn) and we have to preprocess the pattern in time 0(m), the total query time becomes 
0(m + 2 B ~ A f3 B log log n + occ). This concludes the proof of lTheorem 41 and ITheorem 51 

To show ITheorem 61 we apply a black-box reduction very similar to ILemma"8l leading to a 
(k, o)-bounded optional wildcard index, where k and o are the maximum number of normal and 
optional wildcards allowed in the pattern, respectively. This index consists of the following two 
optional wildcard indexes. A query is performed on one of these indexes depending on the length 
m + B of the query pattern p. 

1. The unbounded optional wildcard index given by ITheorem 41 This index has query time 
0(m + 2 B ~ A a B log log n + occ) and uses space 0(n). 

2. The (k, o)-bounded optional wildcard index obtained by using the wildcard tree T 1 fc+ °(pref G (suff ( 
without the LCP data structure, where G = a k+ ° log log n. For (3 = 1 the search for the sub- 
pattern pi can start from at most 2 Bi * locations. Searching for pi from each of these 
locations takes time 0(|pj| + bi), since the LCP data structure is not used and the tree must 
be traversed one character at a time. Summing over the j + 1 subpatterns, we obtain the 
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following query time for the index 



0[J2 2Bi ~ Ai (\Pi\ + bi) + occ 



i=Q 



) 



0(2 B - A (m + B) + occ) 



The index is a wildcard tree and by the same argument as for ITheorem 3( it can be stored 
using space 0{nG k+0 ). 

In case the query pattern p has length m + B > G we query the first index. It follows that 
2 B ~ A cr B log log n < 2 B ~ A G < 2 B ~ A (m + B),so the query time is 0{2 B ~ A (m + B) + occ). If p has 
length m + B < G all occurrences of p in t can be found by querying the second index in time 
0(2 B ~ A (m + B) + occ). The space of the index is 



This concludes the proof of ITheorem 61 

8 Conclusion 

We have presented several new indexes supporting patterns containing wildcards and variable length 
gaps. All previous wildcard indexes have query times which are either exponential in the number 
of wildcards or gaps in the pattern, or linear in the length of the indexed text. We showed that it 
is possible to obtain an index with linear query time while avoiding space usage exponential in the 
length of the indexed string. Moreover, we gave an index with linear space usage and a fast query 
time. For wildcard indexes having a query time sublinear in the length of the indexed string, an 
interesting open problem is whether there is an index where neither the size nor the query time is 
exponential in the number of wildcards or gaps in the pattern. 

References 

[1] S. Alstrup, T. Husfeldt, and T. Rauhe. Marked ancestor problems. In Proc. 39th FOCS, pages 
534-543, 1998. 

[2] A. Amir, M. Lewenstein, and E. Porat. Faster algorithms for string matching with k mis- 
matches. In Proc. 11th SODA, pages 794-803, 2000. 

[3] P. Bille and I. L. G0rtz. Substring Range Reporting. In Proc. 22nd CPM, pages 299-308, 
2011. 

[4] P. Bille, I. L. G0rtz, H. Vildh0j, and D. Wind. String matching with variable length gaps. In 
Proc. 17th SPIRE, pages 385-394, 2010. 

[5] P. Bucher and A. Bairoch. A generalized profile syntax for biomolecular sequence motifs and 
its function in automatic sequence interpretation. In Proc. 2nd ISMB, pages 53-61, 1994. 

[6] H. L. Chan, T. W. Lam, W. K. Sung, S. L. Tarn, and S. S. Wong. A linear size index for 
approximate pattern matching. J. Disc. Algorithms, 9(4):358-364, 2011. 



0(n + nG k+ °) = 0(n(a k+ ° loglogn) k+ °) = 0(na 



(k+o) 2 



log fe+ °logn) . 



17 



[7] B. Chazelle. Filtering search: A new approach to query-answering. SI AM J. Comput., 
15(3):703-724, 1986. 

[8] G. Chen, X. Wu, X. Zhu, A. Arslan, and Y. He. Efficient string matching with wildcards and 
length constraints. Knowl. Inf. Sys., 10(4):399-419, 2006. 

[9] P. Clifford and R. Clifford. Simple deterministic wildcard matching. Inf. Process. Lett., 
101(2):53-54, 2007. 

[10] L. Coelho and A. Oliveira. Dotted suffix trees a structure for approximate text indexing. In 
Proc. 13th SPIRE, pages 329-336, 2006. 

[11] R. Cole, L. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and 
don't cares. In Proc. 36th STOC, pages 91-100, 2004. 

[12] R. Cole and R. Hariharan. Approximate string matching: A simpler faster algorithm. In Proc. 
9th SODA, pages 463-472, 1998. 

[13] R. Cole and R. Hariharan. Verifying candidate matches in sparse and wildcard matching. In 
Proc. 34rd STOC, pages 592-601, 2002. 

[14] M. J. Fischer and M. S. Paterson. String-Matching and Other Products. In Complexity of 
Computation, SIAM-AMS Proceedings, pages 113-125, 1974. 

[15] M. L. Fredman, J. Komlos, and E. Szemeredi. Storing a Sparse Table with 0(1) Worst Case 
Access Time. J. ACM, 31:538-544, 1984. 

[16] K. Fredriksson and S. Grabowski. Efficient algorithms for pattern matching with general gaps, 
character classes, and transposition invariance. Inf. Retr., ll(4):335-357, 2008. 

[17] K. Fredriksson and S. Grabowski. Efficient algorithms for pattern matching with general gaps, 
character classes, and transposition invariance. Inf. Retr., ll(4):335-357, 2008. 

[18] K. Fredriksson and S. Grabowski. Nested counters in bit-parallel string matching. Proc. 3rd 
LATA, pages 338-349, 2009. 

[19] Z. Galil and R. Giancarlo. Improved string matching with k mismatches. ACM SIGACT 
News, 17(4):52-54, 1986. 

[20] D. Harel and R. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. 
Comput, 13(2):338-355, 1984. 

[21] K. Hofmann, P. Bucher, L. Falquet, and A. Bairoch. The PROSITE database, its status in 
1999. Nucleic Acids Res., 27(1):215-219, 1999. 

[22] C. S. Iliopoulos and M. S. Rahman. Pattern matching algorithms with don't cares. In Proc. 
33rd SOFSEM, pages 116-126, 2007. 

[23] A. Kalai. Efficient pattern-matching with don't cares. In Proc. 13th SODA, pages 655-656, 
2002. 



18 



[24] T. W. Lam, W. K. Sung, S. L. Tarn, and S. M. Yiu. Space efficient indexes for string matching 
with don't cares. In Proc. 18th ISAAC, pages 846-857, 2007. 

[25] G. Landau and U. Vishkin. Efficient string matching with k mismatches. Theoret. Comput. 
Sci., 43:239-249, 1986. 

[26] G. Landau and U. Vishkin. Fast parallel and serial approximate string matching. J. Algorithms, 
10(2):157-169, 1989. 

[27] M. Lewenstein. Indexing with gaps. In Proc. 18th SPIRE, pages 135-143, 2011. 

[28] M. Maas and J. Nowak. Text indexing with errors. J. Disc. Algorithms, 5(4):662-681, 2007. 

[29] G. Mehldau and G. Myers. A system for pattern matching applications on biosequences. 
CABIOS, 9(3):299-314, 1993. 

[30] M. Morgante, A. Policriti, N. Vitacolonna, and A. Zuccolo. Structured motifs search. J. 
Comput. Bio., 12(8):1065-1082, 2005. 

[31] E. Myers. Approximate matching of network expressions with spacers. J. Comput. Bio., 
3(1):33-51, 1996. 

[32] G. Navarro, R. Baeza- Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate 
string matching. IEEE Data Eng. Bull, 24(4):19-27, 2001. 

[33] G. Navarro and M. Raffinot. Fast and simple character classes and bounded gaps pattern 
matching, with applications to protein searching. J. Comput. Bio., 10(6):903-923, 2003. 

[34] M. S. Rahman, C. S. Iliopoulos, I. Lee, M. Mohamed, and W. F. Smyth. Finding patterns 
with variable length gaps or don't cares. In Proc. 12th COCOON, pages 146-155, 2006. 

[35] S. Sahinalp and U. Vishkin. Efficient approximate and dynamic matching of patterns using a 
labeling paradigm. In Proc. 31th FOCS, pages 320-328, 1996. 

[36] A. Tarn, E. Wu, T. Lam, and S. Yiu. Succinct text indexing with wildcards. In Proc. 16th 
SPIRE, pages 39-50, 2009. 

[37] D. Tsur. Fast index for approximate string matching. J. Disc. Algorithms, 8(4):339-345, 2010. 

[38] P. Weiner. Linear pattern matching algorithms. In Proc. 14th SWAT, pages 1-11, 1973. 



19 



